We will work on scRNA-seq data of mouse gastrulation and early organogenesis from Pijuan-Sala et al., 2019. This application provides an interactive interface that allows users to validate their own analysis on this data. You can reach the original data and related scripts from the Github page.
Gastrulation is a phase early in the embryonic development of most animals, during which the single-layered blastula is reorganized into a multilayered structure known as the gastrula (Wikipedia, 2020-06-02).
Let’s start with including required libraries.
suppressPackageStartupMessages({
library(Seurat)
library(ggplot2)
library(dplyr)
library(data.table)
library(cowplot)
set.seed(1)
})
In the previous practical, we have learned the essential preprocessing steps for working with scRNA-seq. Here we will start working with the preprocessed version of the mouse gastrulation scRNA-seq data in order to save time. We will load the pre-saved Seurat object of this data.
mgsc <- readRDS('data/gastrulation/mgsc.rds')
mgsc
## An object of class Seurat
## 29452 features across 116312 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
Reminder: number of rows = genes = features, and number of columns = cells = samples
# let's see the available slots
slotNames(mgsc)
## [1] "assays" "meta.data" "active.assay" "active.ident" "graphs"
## [6] "neighbors" "reductions" "project.name" "misc" "version"
## [11] "commands" "tools"
Now let us look at to the metadata table of this data which contains an overview of the samples.
# see it is empty for now
head(mgsc@meta.data, 5)
## data frame with 0 columns and 5 rows
# now let's load the information to a table
metadata <- fread('data/gastrulation/sample_metadata.txt.gz') %>% .[stripped==FALSE & doublet==FALSE]
# and add metadata to our seurat object
mgsc <- AddMetaData(mgsc, metadata = data.frame(metadata, row.names = metadata$cell))
# now let's see once more
head(mgsc@meta.data, 5)
## cell barcode sample stage sequencing.batch doublet stripped
## cell_1 cell_1 AAAGGCCTCCACAA 1 E6.5 1 FALSE FALSE
## cell_2 cell_2 AACAAACTCGCCTT 1 E6.5 1 FALSE FALSE
## cell_5 cell_5 AACAGAGAATCAGC 1 E6.5 1 FALSE FALSE
## cell_6 cell_6 AACATATGAATCGC 1 E6.5 1 FALSE FALSE
## cell_8 cell_8 AACCGATGGCTTCC 1 E6.5 1 FALSE FALSE
## celltype umapX umapY celltype2 celltype3
## cell_1 Epiblast -10.227546 -2.8816875 Epiblast Epiblast-PS
## cell_2 Primitive_Streak -6.625458 0.1089605 Primitive_Streak Epiblast-PS
## cell_5 ExE_ectoderm 10.061009 -0.0293132 ExE_ectoderm ExE_ectoderm
## cell_6 Epiblast -10.454418 -0.2694517 Epiblast Epiblast-PS
## cell_8 Epiblast -11.047206 -2.2052687 Epiblast Epiblast-PS
Now we are able to see the annonations for each cell, e.g. cell types. There is also a column named stage column which shows the embryonic day the cells were sequenced. As our dataset is quite large, we will work on a subset of cells that belong to stage E6.75 in order to speed up our response time.
mgsc_subset <- mgsc[ , mgsc@meta.data$stage=='E6.75']
mgsc_subset
## An object of class Seurat
## 29452 features across 2075 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
# now lets save this subset
saveRDS(mgsc_subset, file = "data/gastrulation/mgsc_e675.rds")
mgsc <- mgsc_subset
This is where you are starting from! :) Now you can go ahead and load mgsc_e675.rds object to start working! Make sure that the object you loaded has same numbers of columns and rows with E6.75 Seurat object.
Q1: Let’s warm up! Can you try to subset cells in locations \(1, 23, 515\) and genes in locations \(21, 44, 116\), respectively?
## An object of class Seurat
## 29452 features across 3 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
## An object of class Seurat
## 3 features across 2075 samples within 1 assay
## Active assay: RNA (3 features, 0 variable features)
We can use rownames and colnames to see the names of genes and cells.
# rownames = gene names
head(rownames(mgsc))
## [1] "ENSMUSG00000051951" "ENSMUSG00000089699" "ENSMUSG00000102343"
## [4] "ENSMUSG00000025900" "ENSMUSG00000025902" "ENSMUSG00000104328"
# colnames = sample/cell names
head(colnames(mgsc))
## [1] "cell_5456" "cell_5457" "cell_5458" "cell_5459" "cell_5460" "cell_5461"
Q2: Observe the count data of the first 30 cells for genes “ENSMUSG00000051951” and “ENSMUSG00000033845”. (Hint: use GetAssayData function. )
## 2 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'cell_5456', 'cell_5457', 'cell_5458' ... ]]
##
## ENSMUSG00000051951 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## ENSMUSG00000033845 1 3 8 2 4 2 2 3 5 5 3 6 1 3 2 2 5 3 4 1 2 4 5 2 3 4 6 7 5 6
The . values in the matrix represent \(0\)s. Since most values in an scRNA-seq matrix are \(0\), Seurat uses a sparse-matrix representation to speed up calculations and reduce memory usage.
Q3: Our feature set (i.e. number of genes) are quite large! What kind of features do you think should be in the dataset?
Seurat implements FindVariableFeatures to model the mean-variance relationship in single-cell data and returns 2000 features per dataset by default. We will use 1000 features to reduce computation time.
mgsc <- FindVariableFeatures(mgsc, selection.method = "vst", nfeatures = 1000)
options(repr.plot.width=12, repr.plot.height=6)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(mgsc), 10)
# plot variable features
plot1 <- VariableFeaturePlot(mgsc)
LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
## Warning: Transformation introduced infinite values in continuous x-axis
head(HVFInfo(mgsc)[VariableFeatures(mgsc), ], 5)
## mean variance variance.standardized
## ENSMUSG00000032083 10.008193 971.0756 24.49617
## ENSMUSG00000024990 3.765301 210.4092 22.38176
## ENSMUSG00000061808 7.784578 546.4217 21.09222
## ENSMUSG00000024503 6.853976 419.2819 19.63029
## ENSMUSG00000024391 3.319036 112.9888 15.76248
Q4: Seurat implements the ScaleData function to scale the data. Do you remember why we need this step?
mgsc <- ScaleData(mgsc) #Vector of features names to scale/center. Default is variable features.
## Centering and scaling data matrix
The results are stored inmgsc[["RNA"]]@scale.data.
mgsc[["RNA"]]@scale.data[1:5,1:5]
## cell_5456 cell_5457 cell_5458 cell_5459 cell_5460
## ENSMUSG00000025902 -0.3250561 -0.3250561 -0.3250561 -0.3250561 -0.3250561
## ENSMUSG00000042182 -0.0428776 -0.0428776 -0.0428776 -0.0428776 -0.0428776
## ENSMUSG00000026124 -0.3533363 -0.3533363 -0.3533363 -0.3533363 -0.3533363
## ENSMUSG00000008136 -0.3580923 -0.3580923 -0.3580923 -0.3580923 -0.3580923
## ENSMUSG00000025993 -0.3715563 -0.3715563 -0.3715563 -0.3715563 -0.3715563
We will RunPCA function of the Seurat.
Q5: Pay attention to the features argument. How many features will PCA work on?
mgsc <- RunPCA(mgsc, npcs = 100, features = VariableFeatures(object = mgsc))
We can explore the constructed embeddings via mgsc@reductions.
slotNames(mgsc@reductions[["pca"]])
## [1] "cell.embeddings" "feature.loadings"
## [3] "feature.loadings.projected" "assay.used"
## [5] "global" "stdev"
## [7] "key" "jackstraw"
## [9] "misc"
dim(mgsc@reductions[["pca"]]@cell.embeddings)
## [1] 2075 100
mgsc@reductions[["pca"]]@cell.embeddings[1:5,1:5]
## PC_1 PC_2 PC_3 PC_4 PC_5
## cell_5456 -7.525323 -2.044025 -1.7048600 -3.297759 -0.00652879
## cell_5457 -8.266647 -3.636625 0.6928462 -3.041104 -0.18143907
## cell_5458 -7.687170 -3.591576 -2.1696897 -3.157680 -0.03483741
## cell_5459 -7.256699 -2.657866 -1.7883639 -3.313753 -1.10164595
## cell_5460 -7.410198 -2.796104 -1.9718807 -4.048726 -0.79348906
# (1) we can visualize top genes associated with pca embeddings
VizDimLoadings(mgsc, dims = 1:2, reduction = "pca")
VizDimLoadings(mgsc, dims = 3:4, reduction = "pca")
Q6: Generate the following output: three 2D plots with the first six PCs and print them side by side. i.e. PC1-PC2, PC3-PC4, PC5-PC6. Utilize DimPlot function of Seurat.
Q7: Name that one method that we used to determine the dimensionality. Find the corresponding Seurat function and use it to print the following plot. How many dimensions do you suggest we keep?
Another alternative is to use JackStraw function. Here is how Seurat defines it:
Randomly permutes a subset of data, and calculates projected PCA scores for these ‘random’ genes. Then compares the PCA scores for the ‘random’ genes with the observed PCA scores to determine statistical signifance. End result is a p-value for each gene’s association with each principal component.
Q8: According to this definition, how do you think we identify important PCs?
mgsc <- JackStraw(mgsc, dims=50, num.replicate = 100)
mgsc <- ScoreJackStraw(mgsc, dims = 1:50)
JackStrawPlot(mgsc, dims = 1:20)
## Warning: Removed 15496 rows containing missing values (geom_point).
This technique, however, might be time-consuming - especially for larger datasets. Elbow plot is a common practice for its speed.
Seurat comprises RunTSNE which uses Rtsne library (which we previously worked with() as a default.
# we are using the PCA embeddings as our input
mgsc <- RunTSNE(mgsc, dims = 1:30, nthreads = 4, max_iter = 2000)
DimPlot(mgsc, reduction = "tsne")
mgsc@reductions
## $pca
## A dimensional reduction object with key PC_
## Number of dimensions: 100
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: TRUE
## Computed using assay: RNA
##
## $tsne
## A dimensional reduction object with key tSNE_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
Q9: Do you remember one of the important hyper-parameters of t-SNE? Print t-SNE plots using DimPlot and explore how structure of the printed data changes with this parameter. Keep dimension as \(30\). (Hint: reduction.name will help you save your embeddings that are produced by different algorithms. )
mgsc <- RunUMAP(mgsc, dims = 1:30)
DimPlot(mgsc, reduction = "umap")
Q10: How about the important parameters for UMAP? Print following UMAP plots using DimPlot and explore how structure of the printed data changes with these two parameters (you can print it for only one parameter). Choose dimensions between \(1:50\).
Homework: Plot the projections of the dataset with three of the dimensionality reduction techniques printed side by side and explore AugmentPlot.
For our subset of time point E6.5, we have the cell type annotation information. Although this is not usually the case in many computational problems, i.e. we start without knowing (or having a very vague idea of) how many clusters we will end up. For now let’s take advantage of the known annotations.
Q11: How many \(real\) classes we have? Reproduce the following plots colored by cell type. ( Hint: You can make use group.by argument in DimPlot to extract stored cluster IDs.)
Seurat does not support K-means? What are we going to do now!? :)
Q12: Which elements do we need to perform \(K-means\)? Try to reproduce the following plot (\(k=10\)). ( Hint: Make use of Embeddings function or utilize an information we previously used in the previous examples. You can save cluster IDs to mgsc@meta.data field. )
## [1] 2075 30
Do you remember our example from the lecture? Which \(k\)-means outcome did we look at the to decide number of clusters?
Q13: Print the following plot using the same embeddings in the previous example. (Hint: Lecture notes:))
Can we decide the number of clusters looking at this plot?
Seurat does not support hierarchical clustering too? But we can do it!!
Q14: Which elements do we need to perform hierarchical clustering? Try to reproduce the following dendogram. Use Euclidian as distance metric and ward.D2 as linkage method.
Q15: How about we print different clustering outcomes with \(k=10, 20, 40\) using t-SNE?
Seuratv3 adopts a graph-based clustering methodology much similar to PhenoGraph approach we discussed earlier.
FindNeighbors function performs the first steps: (1) build a kNN graph using Euclidean distance in PCA space and (2) update the edge weights based on the shared neighbors (Jaccard index) to construct a Shared Nearest Neighbor (SNN) graph.
Once the graph is constucted, we can use FindClusters function to apply a community detection algorithm to identify subgroups. Remember that, Seurat uses the Louvain algorithm as a default for this step.
mgsc <- FindNeighbors(mgsc, k.param = 20, dims = 1:30, reduction = "pca")
mgsc <- FindClusters(mgsc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8377
## Number of communities: 4
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 20 cells
head(Idents(mgsc), 20)
## cell_5456 cell_5457 cell_5458 cell_5459 cell_5460 cell_5461 cell_5462 cell_5463
## 0 0 0 0 0 0 1 0
## cell_5465 cell_5466 cell_5467 cell_5468 cell_5469 cell_5470 cell_5471 cell_5472
## 0 1 0 3 0 0 3 3
## cell_5473 cell_5474 cell_5475 cell_5476
## 2 3 3 0
## Levels: 0 1 2 3
FindClusters has a resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
The Seurat authors suggest that granularity values between 0.4-1.2 usually return good results for sc datasets of around 3K cells. For larger datasets, optimal resolution often increases for larger datasets.
Homework: Explore how different values of \(granularity\) affect the clustering results and reproduce the following plot. (Hint: Use Idents function of Seurat to save cluster ids. )
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8658
## Number of communities: 4
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7710
## Number of communities: 6
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7075
## Number of communities: 9
## Elapsed time: 0 seconds